{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ca7baf0d-4eab-4f9c-9b38-31be8547925b",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Default Rate Estimation using LightGBM on Spark\n",
    "\n",
    "### Introduction\n",
    "As we known, `LightGBM` is a very popular machine learning library in the data competitions and industries because of its excellent effect and interpretability. In this notebook, we will use `Synapse LightGBM` to build our binary classification model for dataset of [Tianchi Competetion](https://tianchi.aliyun.com/competition/entrance/531830/information), which can run on Spark and utilize cluster computing power to train, evaluate and tune the model."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1efadf30-6803-4fb2-89f5-424227cc75d6",
   "metadata": {},
   "source": [
    "### Initialize Spark and Read Dataset\n",
    "\n",
    "In this section, we need init our Spark session and read training dataset stored in `${MY_S3_BUCKET}/risk/tianchi/fg_train_data.csv`. Moreover, it may consume a little more time due to the need to download the `Synapse LightGBM`. You may need a http/https proxy server to speed up the download process as below:\n",
    "\n",
    "```python\n",
    "spark = pyspark.sql.SparkSession.builder\\\n",
    "    .appName(\"Loan Default Estimation-LightGBM\") \\\n",
    "    ...\n",
    "    .config(\"spark.driver.extraJavaOptions\", \"-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort> -Dhttps.proxyHost=<proxyHost> -Dhttps.proxyPort=<proxyPort>\") \\\n",
    "    .config(\"spark.jars.packages\", \"com.microsoft.azure:synapseml_2.12:0.9.4\") \\\n",
    "    .config(\"spark.jars.repositories\", \"https://mmlspark.azureedge.net/maven\") \\\n",
    "    ...\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "9c59a7a4-8cc1-4367-b5b8-4bd4421f1d87",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyspark\n",
    "import yaml\n",
    "import argparse\n",
    "import onnxmltools\n",
    "import subprocess\n",
    "import lightgbm as lgb\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import warnings\n",
    "\n",
    "from pyspark.ml.feature import VectorAssembler\n",
    "from pyspark.ml.feature import VectorAssembler\n",
    "\n",
    "warnings.filterwarnings('ignore')\n",
    "pd.set_option('display.max_rows', None)\n",
    "pd.set_option('display.max_columns', None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b0b31a09-d78a-4f40-b577-2f3ca2904d4f",
   "metadata": {},
   "outputs": [],
   "source": [
    "def init_spark():\n",
    "    spark = pyspark.sql.SparkSession.builder\\\n",
    "            .appName(\"Loan Default Estimation-LightGBM\") \\\n",
    "            .config(\"spark.executor.memory\",\"8G\") \\\n",
    "            .config(\"spark.executor.instances\",\"4\") \\\n",
    "            .config(\"spark.executor.cores\", \"4\") \\\n",
    "            .getOrCreate()\n",
    "    sc = spark.sparkContext\n",
    "    print(sc.version)\n",
    "    print(sc.applicationId)\n",
    "    print(sc.uiWebUrl)\n",
    "    return spark\n",
    "\n",
    "def load_config(path):\n",
    "    params = dict()\n",
    "    with open(path, 'r') as stream:\n",
    "        params = yaml.load(stream, Loader=yaml.FullLoader)\n",
    "    return params\n",
    "\n",
    "def read_dataset(spark, data_path):\n",
    "    dataset = spark.read.format(\"csv\")\\\n",
    "      .option(\"header\",  True)\\\n",
    "      .option(\"inferSchema\",  True)\\\n",
    "      .load(data_path)  \n",
    "    return dataset\n",
    "\n",
    "def get_vectorassembler(dataset, features='features', label='label'):\n",
    "    featurizer = VectorAssembler(\n",
    "        inputCols = feature_cols,\n",
    "        outputCol = 'features',\n",
    "        handleInvalid = 'skip'\n",
    "    )\n",
    "    dataset = featurizer.transform(dataset)[label, features]\n",
    "    return dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "00327285-50c7-41b8-9a07-cc8defe79210",
   "metadata": {},
   "outputs": [],
   "source": [
    "params = load_config('../conf/spark_lgbm_dev.yaml')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "ae52df22-2df2-4128-9d11-e17017d531e8",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: An illegal reflective access operation has occurred\n",
      "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int)\n",
      "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n",
      "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n",
      "WARNING: All illegal access operations will be denied in a future release\n",
      "https://mmlspark.azureedge.net/maven added as a remote repository with the name: repo-1\n",
      "Ivy Default Cache set to: /home/spark/.ivy2/cache\n",
      "The jars for the packages stored in: /home/spark/.ivy2/jars\n",
      "com.microsoft.azure#synapseml_2.12 added as a dependency\n",
      ":: resolving dependencies :: org.apache.spark#spark-submit-parent-5364f257-84ed-40f5-b9e6-8101e06cb45e;1.0\n",
      "\tconfs: [default]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ":: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\tfound com.microsoft.azure#synapseml_2.12;0.9.4 in repo-1\n",
      "\tfound com.microsoft.azure#synapseml-core_2.12;0.9.4 in repo-1\n",
      "\tfound org.scalactic#scalactic_2.12;3.0.5 in central\n",
      "\tfound org.scala-lang#scala-reflect;2.12.4 in central\n",
      "\tfound io.spray#spray-json_2.12;1.3.2 in central\n",
      "\tfound com.jcraft#jsch;0.1.54 in central\n",
      "\tfound org.apache.httpcomponents#httpclient;4.5.6 in central\n",
      "\tfound org.apache.httpcomponents#httpcore;4.4.10 in central\n",
      "\tfound commons-logging#commons-logging;1.2 in central\n",
      "\tfound commons-codec#commons-codec;1.10 in central\n",
      "\tfound org.apache.httpcomponents#httpmime;4.5.6 in central\n",
      "\tfound com.linkedin.isolation-forest#isolation-forest_3.0.0_2.12;1.0.1 in central\n",
      "\tfound com.chuusai#shapeless_2.12;2.3.2 in central\n",
      "\tfound org.typelevel#macro-compat_2.12;1.1.1 in central\n",
      "\tfound org.apache.spark#spark-avro_2.12;3.0.0 in central\n",
      "\tfound org.spark-project.spark#unused;1.0.0 in central\n",
      "\tfound org.testng#testng;6.8.8 in central\n",
      "\tfound org.beanshell#bsh;2.0b4 in central\n",
      "\tfound com.beust#jcommander;1.27 in central\n",
      "\tfound com.microsoft.azure#synapseml-deep-learning_2.12;0.9.4 in repo-1\n",
      "\tfound com.microsoft.azure#synapseml-opencv_2.12;0.9.4 in repo-1\n",
      "\tfound org.openpnp#opencv;3.2.0-1 in central\n",
      "\tfound com.microsoft.cntk#cntk;2.4 in central\n",
      "\tfound com.microsoft.onnxruntime#onnxruntime_gpu;1.8.1 in central\n",
      "\tfound com.microsoft.azure#synapseml-cognitive_2.12;0.9.4 in repo-1\n",
      "\tfound com.microsoft.cognitiveservices.speech#client-sdk;1.14.0 in repo-1\n",
      "\tfound com.microsoft.azure#synapseml-vw_2.12;0.9.4 in repo-1\n",
      "\tfound com.github.vowpalwabbit#vw-jni;8.9.1 in central\n",
      "\tfound com.microsoft.azure#synapseml-lightgbm_2.12;0.9.4 in repo-1\n",
      "\tfound com.microsoft.ml.lightgbm#lightgbmlib;3.2.110 in central\n",
      ":: resolution report :: resolve 400ms :: artifacts dl 12ms\n",
      "\t:: modules in use:\n",
      "\tcom.beust#jcommander;1.27 from central in [default]\n",
      "\tcom.chuusai#shapeless_2.12;2.3.2 from central in [default]\n",
      "\tcom.github.vowpalwabbit#vw-jni;8.9.1 from central in [default]\n",
      "\tcom.jcraft#jsch;0.1.54 from central in [default]\n",
      "\tcom.linkedin.isolation-forest#isolation-forest_3.0.0_2.12;1.0.1 from central in [default]\n",
      "\tcom.microsoft.azure#synapseml-cognitive_2.12;0.9.4 from repo-1 in [default]\n",
      "\tcom.microsoft.azure#synapseml-core_2.12;0.9.4 from repo-1 in [default]\n",
      "\tcom.microsoft.azure#synapseml-deep-learning_2.12;0.9.4 from repo-1 in [default]\n",
      "\tcom.microsoft.azure#synapseml-lightgbm_2.12;0.9.4 from repo-1 in [default]\n",
      "\tcom.microsoft.azure#synapseml-opencv_2.12;0.9.4 from repo-1 in [default]\n",
      "\tcom.microsoft.azure#synapseml-vw_2.12;0.9.4 from repo-1 in [default]\n",
      "\tcom.microsoft.azure#synapseml_2.12;0.9.4 from repo-1 in [default]\n",
      "\tcom.microsoft.cntk#cntk;2.4 from central in [default]\n",
      "\tcom.microsoft.cognitiveservices.speech#client-sdk;1.14.0 from repo-1 in [default]\n",
      "\tcom.microsoft.ml.lightgbm#lightgbmlib;3.2.110 from central in [default]\n",
      "\tcom.microsoft.onnxruntime#onnxruntime_gpu;1.8.1 from central in [default]\n",
      "\tcommons-codec#commons-codec;1.10 from central in [default]\n",
      "\tcommons-logging#commons-logging;1.2 from central in [default]\n",
      "\tio.spray#spray-json_2.12;1.3.2 from central in [default]\n",
      "\torg.apache.httpcomponents#httpclient;4.5.6 from central in [default]\n",
      "\torg.apache.httpcomponents#httpcore;4.4.10 from central in [default]\n",
      "\torg.apache.httpcomponents#httpmime;4.5.6 from central in [default]\n",
      "\torg.apache.spark#spark-avro_2.12;3.0.0 from central in [default]\n",
      "\torg.beanshell#bsh;2.0b4 from central in [default]\n",
      "\torg.openpnp#opencv;3.2.0-1 from central in [default]\n",
      "\torg.scala-lang#scala-reflect;2.12.4 from central in [default]\n",
      "\torg.scalactic#scalactic_2.12;3.0.5 from central in [default]\n",
      "\torg.spark-project.spark#unused;1.0.0 from central in [default]\n",
      "\torg.testng#testng;6.8.8 from central in [default]\n",
      "\torg.typelevel#macro-compat_2.12;1.1.1 from central in [default]\n",
      "\t---------------------------------------------------------------------\n",
      "\t|                  |            modules            ||   artifacts   |\n",
      "\t|       conf       | number| search|dwnlded|evicted|| number|dwnlded|\n",
      "\t---------------------------------------------------------------------\n",
      "\t|      default     |   30  |   0   |   0   |   0   ||   30  |   0   |\n",
      "\t---------------------------------------------------------------------\n",
      ":: retrieving :: org.apache.spark#spark-submit-parent-5364f257-84ed-40f5-b9e6-8101e06cb45e\n",
      "\tconfs: [default]\n",
      "\t0 artifacts copied, 30 already retrieved (0kB/10ms)\n",
      "22/05/31 11:56:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n",
      "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n",
      "Setting default log level to \"WARN\".\n",
      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.1.2\n",
      "spark-application-1653998207409\n",
      "http://jupyter.my.nginx.test/hub/user-redirect/proxy/4040/jobs/\n"
     ]
    }
   ],
   "source": [
    "spark = init_spark()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "4b7e7e97-a41f-4243-9e59-e30bba25ea94",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "data_path = params['fg_train_dataset_path']\n",
    "fg_train_dataset = read_dataset(spark, data_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "7d34e6e1-d17c-4642-bca4-40bbd31af505",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "22/05/31 11:57:03 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>loanAmnt</th>\n",
       "      <th>term</th>\n",
       "      <th>interestRate</th>\n",
       "      <th>installment</th>\n",
       "      <th>grade</th>\n",
       "      <th>subGrade</th>\n",
       "      <th>employmentTitle</th>\n",
       "      <th>employmentLength</th>\n",
       "      <th>homeOwnership</th>\n",
       "      <th>annualIncome</th>\n",
       "      <th>verificationStatus</th>\n",
       "      <th>isDefault</th>\n",
       "      <th>purpose</th>\n",
       "      <th>postCode</th>\n",
       "      <th>regionCode</th>\n",
       "      <th>dti</th>\n",
       "      <th>delinquency_2years</th>\n",
       "      <th>ficoRangeLow</th>\n",
       "      <th>ficoRangeHigh</th>\n",
       "      <th>openAcc</th>\n",
       "      <th>pubRec</th>\n",
       "      <th>pubRecBankruptcies</th>\n",
       "      <th>revolBal</th>\n",
       "      <th>revolUtil</th>\n",
       "      <th>totalAcc</th>\n",
       "      <th>initialListStatus</th>\n",
       "      <th>applicationType</th>\n",
       "      <th>earliesCreditLine</th>\n",
       "      <th>title</th>\n",
       "      <th>policyCode</th>\n",
       "      <th>n0</th>\n",
       "      <th>n1</th>\n",
       "      <th>n2</th>\n",
       "      <th>n3</th>\n",
       "      <th>n4</th>\n",
       "      <th>n5</th>\n",
       "      <th>n6</th>\n",
       "      <th>n7</th>\n",
       "      <th>n8</th>\n",
       "      <th>n9</th>\n",
       "      <th>n10</th>\n",
       "      <th>n11</th>\n",
       "      <th>n12</th>\n",
       "      <th>n13</th>\n",
       "      <th>n14</th>\n",
       "      <th>issueDateDT</th>\n",
       "      <th>grade_target_mean</th>\n",
       "      <th>subGrade_target_mean</th>\n",
       "      <th>grade_to_mean_n0</th>\n",
       "      <th>grade_to_std_n0</th>\n",
       "      <th>grade_to_mean_n1</th>\n",
       "      <th>grade_to_std_n1</th>\n",
       "      <th>grade_to_mean_n2</th>\n",
       "      <th>grade_to_std_n2</th>\n",
       "      <th>grade_to_mean_n4</th>\n",
       "      <th>grade_to_std_n4</th>\n",
       "      <th>grade_to_mean_n5</th>\n",
       "      <th>grade_to_std_n5</th>\n",
       "      <th>grade_to_mean_n6</th>\n",
       "      <th>grade_to_std_n6</th>\n",
       "      <th>grade_to_mean_n7</th>\n",
       "      <th>grade_to_std_n7</th>\n",
       "      <th>grade_to_mean_n8</th>\n",
       "      <th>grade_to_std_n8</th>\n",
       "      <th>grade_to_mean_n9</th>\n",
       "      <th>grade_to_std_n9</th>\n",
       "      <th>grade_to_mean_n10</th>\n",
       "      <th>grade_to_std_n10</th>\n",
       "      <th>grade_to_mean_n11</th>\n",
       "      <th>grade_to_std_n11</th>\n",
       "      <th>grade_to_mean_n12</th>\n",
       "      <th>grade_to_std_n12</th>\n",
       "      <th>grade_to_mean_n13</th>\n",
       "      <th>grade_to_std_n13</th>\n",
       "      <th>grade_to_mean_n14</th>\n",
       "      <th>grade_to_std_n14</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>35000.0</td>\n",
       "      <td>5</td>\n",
       "      <td>19.52</td>\n",
       "      <td>917.97</td>\n",
       "      <td>5</td>\n",
       "      <td>21</td>\n",
       "      <td>161280</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>110000.0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>43</td>\n",
       "      <td>32</td>\n",
       "      <td>17.05</td>\n",
       "      <td>0.0</td>\n",
       "      <td>730.0</td>\n",
       "      <td>734.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>24178.0</td>\n",
       "      <td>48.9</td>\n",
       "      <td>27.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2001</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>2587</td>\n",
       "      <td>0.386234</td>\n",
       "      <td>0.380444</td>\n",
       "      <td>1.876011</td>\n",
       "      <td>3.992386</td>\n",
       "      <td>1.874620</td>\n",
       "      <td>4.053876</td>\n",
       "      <td>1.942294</td>\n",
       "      <td>4.023418</td>\n",
       "      <td>1.869160</td>\n",
       "      <td>3.948124</td>\n",
       "      <td>1.897562</td>\n",
       "      <td>4.055665</td>\n",
       "      <td>1.865760</td>\n",
       "      <td>4.017884</td>\n",
       "      <td>1.840872</td>\n",
       "      <td>4.074681</td>\n",
       "      <td>1.851544</td>\n",
       "      <td>4.040923</td>\n",
       "      <td>1.938318</td>\n",
       "      <td>4.024912</td>\n",
       "      <td>1.842210</td>\n",
       "      <td>4.108917</td>\n",
       "      <td>1.852810</td>\n",
       "      <td>4.009823</td>\n",
       "      <td>1.852810</td>\n",
       "      <td>4.009823</td>\n",
       "      <td>1.857394</td>\n",
       "      <td>4.005352</td>\n",
       "      <td>1.856379</td>\n",
       "      <td>3.991791</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>18000.0</td>\n",
       "      <td>5</td>\n",
       "      <td>18.49</td>\n",
       "      <td>461.90</td>\n",
       "      <td>4</td>\n",
       "      <td>16</td>\n",
       "      <td>89538</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>46000.0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>64</td>\n",
       "      <td>18</td>\n",
       "      <td>27.83</td>\n",
       "      <td>0.0</td>\n",
       "      <td>700.0</td>\n",
       "      <td>704.0</td>\n",
       "      <td>13.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>15096.0</td>\n",
       "      <td>38.9</td>\n",
       "      <td>18.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2002</td>\n",
       "      <td>5768</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>13.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>13.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1888</td>\n",
       "      <td>0.304227</td>\n",
       "      <td>0.298190</td>\n",
       "      <td>1.500809</td>\n",
       "      <td>3.193909</td>\n",
       "      <td>1.502905</td>\n",
       "      <td>3.185919</td>\n",
       "      <td>1.504054</td>\n",
       "      <td>3.173189</td>\n",
       "      <td>1.567352</td>\n",
       "      <td>3.204484</td>\n",
       "      <td>1.511316</td>\n",
       "      <td>3.139166</td>\n",
       "      <td>1.515599</td>\n",
       "      <td>3.098975</td>\n",
       "      <td>1.500817</td>\n",
       "      <td>3.139721</td>\n",
       "      <td>1.517874</td>\n",
       "      <td>3.086106</td>\n",
       "      <td>1.504140</td>\n",
       "      <td>3.174194</td>\n",
       "      <td>1.484104</td>\n",
       "      <td>3.173687</td>\n",
       "      <td>1.482248</td>\n",
       "      <td>3.207858</td>\n",
       "      <td>1.482248</td>\n",
       "      <td>3.207858</td>\n",
       "      <td>1.485915</td>\n",
       "      <td>3.204282</td>\n",
       "      <td>1.485103</td>\n",
       "      <td>3.193433</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>12000.0</td>\n",
       "      <td>5</td>\n",
       "      <td>16.99</td>\n",
       "      <td>298.17</td>\n",
       "      <td>4</td>\n",
       "      <td>17</td>\n",
       "      <td>159367</td>\n",
       "      <td>8</td>\n",
       "      <td>0</td>\n",
       "      <td>74000.0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>265</td>\n",
       "      <td>14</td>\n",
       "      <td>22.77</td>\n",
       "      <td>0.0</td>\n",
       "      <td>675.0</td>\n",
       "      <td>679.0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4606.0</td>\n",
       "      <td>51.8</td>\n",
       "      <td>27.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2006</td>\n",
       "      <td>0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>21.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>3044</td>\n",
       "      <td>0.304227</td>\n",
       "      <td>0.302541</td>\n",
       "      <td>1.500809</td>\n",
       "      <td>3.193909</td>\n",
       "      <td>1.360761</td>\n",
       "      <td>2.998190</td>\n",
       "      <td>1.532981</td>\n",
       "      <td>3.241462</td>\n",
       "      <td>1.273891</td>\n",
       "      <td>3.071276</td>\n",
       "      <td>1.162371</td>\n",
       "      <td>3.176718</td>\n",
       "      <td>1.480241</td>\n",
       "      <td>3.125317</td>\n",
       "      <td>1.472698</td>\n",
       "      <td>3.259745</td>\n",
       "      <td>1.406712</td>\n",
       "      <td>3.254085</td>\n",
       "      <td>1.530998</td>\n",
       "      <td>3.244609</td>\n",
       "      <td>1.504230</td>\n",
       "      <td>3.089208</td>\n",
       "      <td>1.482248</td>\n",
       "      <td>3.207858</td>\n",
       "      <td>1.482248</td>\n",
       "      <td>3.207858</td>\n",
       "      <td>1.485915</td>\n",
       "      <td>3.204282</td>\n",
       "      <td>1.315111</td>\n",
       "      <td>3.146801</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2050.0</td>\n",
       "      <td>3</td>\n",
       "      <td>7.69</td>\n",
       "      <td>63.95</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>59830</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>35000.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>465</td>\n",
       "      <td>14</td>\n",
       "      <td>17.49</td>\n",
       "      <td>0.0</td>\n",
       "      <td>755.0</td>\n",
       "      <td>759.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3111.0</td>\n",
       "      <td>8.5</td>\n",
       "      <td>23.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2006</td>\n",
       "      <td>0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>18.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>2679</td>\n",
       "      <td>0.059838</td>\n",
       "      <td>0.065532</td>\n",
       "      <td>0.375202</td>\n",
       "      <td>0.798477</td>\n",
       "      <td>0.368239</td>\n",
       "      <td>0.796491</td>\n",
       "      <td>0.383245</td>\n",
       "      <td>0.810366</td>\n",
       "      <td>0.380622</td>\n",
       "      <td>0.806605</td>\n",
       "      <td>0.384972</td>\n",
       "      <td>0.802575</td>\n",
       "      <td>0.368526</td>\n",
       "      <td>0.819126</td>\n",
       "      <td>0.369865</td>\n",
       "      <td>0.798404</td>\n",
       "      <td>0.377964</td>\n",
       "      <td>0.799464</td>\n",
       "      <td>0.382750</td>\n",
       "      <td>0.811152</td>\n",
       "      <td>0.370128</td>\n",
       "      <td>0.799459</td>\n",
       "      <td>0.370562</td>\n",
       "      <td>0.801965</td>\n",
       "      <td>0.370562</td>\n",
       "      <td>0.801965</td>\n",
       "      <td>0.371479</td>\n",
       "      <td>0.801070</td>\n",
       "      <td>0.344287</td>\n",
       "      <td>0.793451</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>11500.0</td>\n",
       "      <td>3</td>\n",
       "      <td>14.98</td>\n",
       "      <td>398.54</td>\n",
       "      <td>3</td>\n",
       "      <td>12</td>\n",
       "      <td>85242</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>30000.0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>32.60</td>\n",
       "      <td>0.0</td>\n",
       "      <td>665.0</td>\n",
       "      <td>669.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>14021.0</td>\n",
       "      <td>59.7</td>\n",
       "      <td>33.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1994</td>\n",
       "      <td>0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>16.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>21.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>2406</td>\n",
       "      <td>0.224522</td>\n",
       "      <td>0.224686</td>\n",
       "      <td>1.125607</td>\n",
       "      <td>2.395431</td>\n",
       "      <td>1.113406</td>\n",
       "      <td>2.430896</td>\n",
       "      <td>1.133984</td>\n",
       "      <td>2.439745</td>\n",
       "      <td>1.121496</td>\n",
       "      <td>2.368874</td>\n",
       "      <td>1.197930</td>\n",
       "      <td>2.401168</td>\n",
       "      <td>1.120956</td>\n",
       "      <td>2.388727</td>\n",
       "      <td>1.106851</td>\n",
       "      <td>2.450979</td>\n",
       "      <td>1.144817</td>\n",
       "      <td>2.403154</td>\n",
       "      <td>1.133458</td>\n",
       "      <td>2.441340</td>\n",
       "      <td>1.104961</td>\n",
       "      <td>2.446307</td>\n",
       "      <td>1.111686</td>\n",
       "      <td>2.405894</td>\n",
       "      <td>1.111686</td>\n",
       "      <td>2.405894</td>\n",
       "      <td>1.114436</td>\n",
       "      <td>2.403211</td>\n",
       "      <td>1.113827</td>\n",
       "      <td>2.395075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>12000.0</td>\n",
       "      <td>3</td>\n",
       "      <td>12.99</td>\n",
       "      <td>404.27</td>\n",
       "      <td>3</td>\n",
       "      <td>11</td>\n",
       "      <td>65718</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>60000.0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>770</td>\n",
       "      <td>13</td>\n",
       "      <td>19.22</td>\n",
       "      <td>0.0</td>\n",
       "      <td>690.0</td>\n",
       "      <td>694.0</td>\n",
       "      <td>15.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>27176.0</td>\n",
       "      <td>46.0</td>\n",
       "      <td>21.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1994</td>\n",
       "      <td>0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>13.0</td>\n",
       "      <td>13.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>13.0</td>\n",
       "      <td>17.0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>15.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>3257</td>\n",
       "      <td>0.224522</td>\n",
       "      <td>0.204005</td>\n",
       "      <td>1.125607</td>\n",
       "      <td>2.395431</td>\n",
       "      <td>1.085997</td>\n",
       "      <td>2.408741</td>\n",
       "      <td>0.984707</td>\n",
       "      <td>2.361605</td>\n",
       "      <td>1.141867</td>\n",
       "      <td>2.419815</td>\n",
       "      <td>1.133487</td>\n",
       "      <td>2.354374</td>\n",
       "      <td>1.100101</td>\n",
       "      <td>2.459716</td>\n",
       "      <td>1.119411</td>\n",
       "      <td>2.396658</td>\n",
       "      <td>1.136053</td>\n",
       "      <td>2.409156</td>\n",
       "      <td>1.011351</td>\n",
       "      <td>2.376224</td>\n",
       "      <td>1.124941</td>\n",
       "      <td>2.384061</td>\n",
       "      <td>1.111686</td>\n",
       "      <td>2.405894</td>\n",
       "      <td>1.111686</td>\n",
       "      <td>2.405894</td>\n",
       "      <td>1.114436</td>\n",
       "      <td>2.403211</td>\n",
       "      <td>0.923430</td>\n",
       "      <td>2.361914</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>24000.0</td>\n",
       "      <td>3</td>\n",
       "      <td>9.99</td>\n",
       "      <td>774.30</td>\n",
       "      <td>2</td>\n",
       "      <td>7</td>\n",
       "      <td>209276</td>\n",
       "      <td>10</td>\n",
       "      <td>0</td>\n",
       "      <td>150000.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>40</td>\n",
       "      <td>8</td>\n",
       "      <td>5.68</td>\n",
       "      <td>0.0</td>\n",
       "      <td>690.0</td>\n",
       "      <td>694.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4334.0</td>\n",
       "      <td>68.8</td>\n",
       "      <td>25.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1983</td>\n",
       "      <td>18780</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>17.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>2983</td>\n",
       "      <td>0.131210</td>\n",
       "      <td>0.128111</td>\n",
       "      <td>0.707941</td>\n",
       "      <td>1.635584</td>\n",
       "      <td>0.736477</td>\n",
       "      <td>1.592982</td>\n",
       "      <td>0.766491</td>\n",
       "      <td>1.620731</td>\n",
       "      <td>0.720818</td>\n",
       "      <td>1.621383</td>\n",
       "      <td>0.755658</td>\n",
       "      <td>1.569583</td>\n",
       "      <td>0.757800</td>\n",
       "      <td>1.549487</td>\n",
       "      <td>0.738697</td>\n",
       "      <td>1.625010</td>\n",
       "      <td>0.757368</td>\n",
       "      <td>1.606104</td>\n",
       "      <td>0.765499</td>\n",
       "      <td>1.622304</td>\n",
       "      <td>0.736884</td>\n",
       "      <td>1.643567</td>\n",
       "      <td>0.741124</td>\n",
       "      <td>1.603929</td>\n",
       "      <td>0.741124</td>\n",
       "      <td>1.603929</td>\n",
       "      <td>0.742958</td>\n",
       "      <td>1.602141</td>\n",
       "      <td>0.742552</td>\n",
       "      <td>1.596716</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>16000.0</td>\n",
       "      <td>3</td>\n",
       "      <td>7.91</td>\n",
       "      <td>500.72</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>8198</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>50000.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>76</td>\n",
       "      <td>8</td>\n",
       "      <td>38.95</td>\n",
       "      <td>0.0</td>\n",
       "      <td>710.0</td>\n",
       "      <td>714.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>19023.0</td>\n",
       "      <td>60.8</td>\n",
       "      <td>11.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2011</td>\n",
       "      <td>16334</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>3136</td>\n",
       "      <td>0.059838</td>\n",
       "      <td>0.083522</td>\n",
       "      <td>0.375202</td>\n",
       "      <td>0.798477</td>\n",
       "      <td>0.371135</td>\n",
       "      <td>0.810299</td>\n",
       "      <td>0.376013</td>\n",
       "      <td>0.793297</td>\n",
       "      <td>0.373832</td>\n",
       "      <td>0.789625</td>\n",
       "      <td>0.368325</td>\n",
       "      <td>0.815212</td>\n",
       "      <td>0.366700</td>\n",
       "      <td>0.819905</td>\n",
       "      <td>0.375204</td>\n",
       "      <td>0.784930</td>\n",
       "      <td>0.364666</td>\n",
       "      <td>0.813245</td>\n",
       "      <td>0.376035</td>\n",
       "      <td>0.793549</td>\n",
       "      <td>0.368003</td>\n",
       "      <td>0.809138</td>\n",
       "      <td>0.370562</td>\n",
       "      <td>0.801965</td>\n",
       "      <td>0.370562</td>\n",
       "      <td>0.801965</td>\n",
       "      <td>0.371479</td>\n",
       "      <td>0.801070</td>\n",
       "      <td>0.395135</td>\n",
       "      <td>0.846111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>6000.0</td>\n",
       "      <td>3</td>\n",
       "      <td>10.49</td>\n",
       "      <td>194.99</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "      <td>115263</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>77000.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>106</td>\n",
       "      <td>38</td>\n",
       "      <td>17.27</td>\n",
       "      <td>0.0</td>\n",
       "      <td>660.0</td>\n",
       "      <td>664.0</td>\n",
       "      <td>16.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>220.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>49.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1996</td>\n",
       "      <td>18780</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>14.0</td>\n",
       "      <td>13.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>15.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3533</td>\n",
       "      <td>0.131210</td>\n",
       "      <td>0.109461</td>\n",
       "      <td>0.750404</td>\n",
       "      <td>1.596954</td>\n",
       "      <td>0.736477</td>\n",
       "      <td>1.592982</td>\n",
       "      <td>0.755989</td>\n",
       "      <td>1.626497</td>\n",
       "      <td>0.720818</td>\n",
       "      <td>1.621383</td>\n",
       "      <td>0.769944</td>\n",
       "      <td>1.605151</td>\n",
       "      <td>0.739618</td>\n",
       "      <td>1.580526</td>\n",
       "      <td>0.746274</td>\n",
       "      <td>1.597772</td>\n",
       "      <td>0.788374</td>\n",
       "      <td>1.610142</td>\n",
       "      <td>0.755638</td>\n",
       "      <td>1.627560</td>\n",
       "      <td>0.749961</td>\n",
       "      <td>1.589374</td>\n",
       "      <td>0.741124</td>\n",
       "      <td>1.603929</td>\n",
       "      <td>0.741124</td>\n",
       "      <td>1.603929</td>\n",
       "      <td>0.742958</td>\n",
       "      <td>1.602141</td>\n",
       "      <td>0.846155</td>\n",
       "      <td>1.753293</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>10375.0</td>\n",
       "      <td>5</td>\n",
       "      <td>15.61</td>\n",
       "      <td>250.16</td>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "      <td>74728</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>58000.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>437</td>\n",
       "      <td>36</td>\n",
       "      <td>21.02</td>\n",
       "      <td>0.0</td>\n",
       "      <td>705.0</td>\n",
       "      <td>709.0</td>\n",
       "      <td>16.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>36609.0</td>\n",
       "      <td>61.1</td>\n",
       "      <td>33.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2002</td>\n",
       "      <td>18780</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>14.0</td>\n",
       "      <td>13.0</td>\n",
       "      <td>14.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>16.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>2526</td>\n",
       "      <td>0.304227</td>\n",
       "      <td>0.279444</td>\n",
       "      <td>1.500809</td>\n",
       "      <td>3.193909</td>\n",
       "      <td>1.502905</td>\n",
       "      <td>3.185919</td>\n",
       "      <td>1.511979</td>\n",
       "      <td>3.252993</td>\n",
       "      <td>1.494754</td>\n",
       "      <td>3.218213</td>\n",
       "      <td>1.473298</td>\n",
       "      <td>3.260850</td>\n",
       "      <td>1.479236</td>\n",
       "      <td>3.161051</td>\n",
       "      <td>1.492548</td>\n",
       "      <td>3.195544</td>\n",
       "      <td>1.497336</td>\n",
       "      <td>3.234727</td>\n",
       "      <td>1.511277</td>\n",
       "      <td>3.255120</td>\n",
       "      <td>1.496655</td>\n",
       "      <td>3.146687</td>\n",
       "      <td>1.482248</td>\n",
       "      <td>3.207858</td>\n",
       "      <td>1.482248</td>\n",
       "      <td>3.207858</td>\n",
       "      <td>1.485915</td>\n",
       "      <td>3.204282</td>\n",
       "      <td>1.485103</td>\n",
       "      <td>3.193433</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   loanAmnt  term  interestRate  installment  grade  subGrade  \\\n",
       "0   35000.0     5         19.52       917.97      5        21   \n",
       "1   18000.0     5         18.49       461.90      4        16   \n",
       "2   12000.0     5         16.99       298.17      4        17   \n",
       "3    2050.0     3          7.69        63.95      1         3   \n",
       "4   11500.0     3         14.98       398.54      3        12   \n",
       "5   12000.0     3         12.99       404.27      3        11   \n",
       "6   24000.0     3          9.99       774.30      2         7   \n",
       "7   16000.0     3          7.91       500.72      1         4   \n",
       "8    6000.0     3         10.49       194.99      2         6   \n",
       "9   10375.0     5         15.61       250.16      4        15   \n",
       "\n",
       "   employmentTitle  employmentLength  homeOwnership  annualIncome  \\\n",
       "0           161280                 2              2      110000.0   \n",
       "1            89538                 5              0       46000.0   \n",
       "2           159367                 8              0       74000.0   \n",
       "3            59830                 9              0       35000.0   \n",
       "4            85242                 1              1       30000.0   \n",
       "5            65718                 5              2       60000.0   \n",
       "6           209276                10              0      150000.0   \n",
       "7             8198                 2              1       50000.0   \n",
       "8           115263                 2              0       77000.0   \n",
       "9            74728                 9              0       58000.0   \n",
       "\n",
       "   verificationStatus  isDefault  purpose  postCode  regionCode    dti  \\\n",
       "0                   2          1        1        43          32  17.05   \n",
       "1                   2          0        0        64          18  27.83   \n",
       "2                   2          0        0       265          14  22.77   \n",
       "3                   0          0        0       465          14  17.49   \n",
       "4                   2          0        0         3           4  32.60   \n",
       "5                   1          1        0       770          13  19.22   \n",
       "6                   1          0        2        40           8   5.68   \n",
       "7                   0          0        4        76           8  38.95   \n",
       "8                   1          0        2       106          38  17.27   \n",
       "9                   0          0        2       437          36  21.02   \n",
       "\n",
       "   delinquency_2years  ficoRangeLow  ficoRangeHigh  openAcc  pubRec  \\\n",
       "0                 0.0         730.0          734.0      7.0     0.0   \n",
       "1                 0.0         700.0          704.0     13.0     0.0   \n",
       "2                 0.0         675.0          679.0     11.0     0.0   \n",
       "3                 0.0         755.0          759.0     12.0     0.0   \n",
       "4                 0.0         665.0          669.0      8.0     1.0   \n",
       "5                 0.0         690.0          694.0     15.0     0.0   \n",
       "6                 0.0         690.0          694.0      7.0     0.0   \n",
       "7                 0.0         710.0          714.0      9.0     0.0   \n",
       "8                 0.0         660.0          664.0     16.0     1.0   \n",
       "9                 0.0         705.0          709.0     16.0     0.0   \n",
       "\n",
       "   pubRecBankruptcies  revolBal  revolUtil  totalAcc  initialListStatus  \\\n",
       "0                 0.0   24178.0       48.9      27.0                  0   \n",
       "1                 0.0   15096.0       38.9      18.0                  1   \n",
       "2                 0.0    4606.0       51.8      27.0                  0   \n",
       "3                 0.0    3111.0        8.5      23.0                  0   \n",
       "4                 1.0   14021.0       59.7      33.0                  1   \n",
       "5                 0.0   27176.0       46.0      21.0                  1   \n",
       "6                 0.0    4334.0       68.8      25.0                  0   \n",
       "7                 0.0   19023.0       60.8      11.0                  0   \n",
       "8                 1.0     220.0        3.6      49.0                  0   \n",
       "9                 0.0   36609.0       61.1      33.0                  0   \n",
       "\n",
       "   applicationType  earliesCreditLine  title  policyCode   n0   n1    n2  \\\n",
       "0                0               2001      1         1.0  0.0  2.0   2.0   \n",
       "1                0               2002   5768         1.0  0.0  3.0   5.0   \n",
       "2                0               2006      0         1.0  0.0  0.0   3.0   \n",
       "3                0               2006      0         1.0  0.0  1.0   3.0   \n",
       "4                0               1994      0         1.0  0.0  4.0   4.0   \n",
       "5                0               1994      0         1.0  0.0  7.0  13.0   \n",
       "6                0               1983  18780         1.0  1.0  1.0   3.0   \n",
       "7                0               2011  16334         1.0  0.0  4.0   5.0   \n",
       "8                0               1996  18780         1.0  0.0  1.0   4.0   \n",
       "9                0               2002  18780         1.0  0.0  3.0   4.0   \n",
       "\n",
       "     n3    n4    n5    n6    n7    n8    n9   n10  n11  n12  n13  n14  \\\n",
       "0   2.0   4.0   9.0   8.0   4.0  12.0   2.0   7.0  0.0  0.0  0.0  2.0   \n",
       "1   5.0  10.0   7.0   7.0   7.0  13.0   5.0  13.0  0.0  0.0  0.0  2.0   \n",
       "2   3.0   0.0   0.0  21.0   4.0   5.0   3.0  11.0  0.0  0.0  0.0  4.0   \n",
       "3   3.0   7.0  11.0   3.0  10.0  18.0   3.0  12.0  0.0  0.0  0.0  3.0   \n",
       "4   4.0   4.0  16.0  10.0   5.0  21.0   4.0   8.0  0.0  0.0  0.0  2.0   \n",
       "5  13.0   7.0   7.0   2.0  13.0  17.0  11.0  15.0  0.0  0.0  0.0  6.0   \n",
       "6   3.0   2.0   7.0   7.0   6.0  17.0   3.0   7.0  0.0  0.0  0.0  2.0   \n",
       "7   5.0   4.0   6.0   2.0   7.0   9.0   5.0   9.0  0.0  0.0  0.0  1.0   \n",
       "8   4.0   2.0  11.0  14.0  13.0  32.0   4.0  15.0  0.0  0.0  0.0  0.0   \n",
       "9   4.0   5.0   6.0  14.0  13.0  14.0   4.0  16.0  0.0  0.0  0.0  2.0   \n",
       "\n",
       "   issueDateDT  grade_target_mean  subGrade_target_mean  grade_to_mean_n0  \\\n",
       "0         2587           0.386234              0.380444          1.876011   \n",
       "1         1888           0.304227              0.298190          1.500809   \n",
       "2         3044           0.304227              0.302541          1.500809   \n",
       "3         2679           0.059838              0.065532          0.375202   \n",
       "4         2406           0.224522              0.224686          1.125607   \n",
       "5         3257           0.224522              0.204005          1.125607   \n",
       "6         2983           0.131210              0.128111          0.707941   \n",
       "7         3136           0.059838              0.083522          0.375202   \n",
       "8         3533           0.131210              0.109461          0.750404   \n",
       "9         2526           0.304227              0.279444          1.500809   \n",
       "\n",
       "   grade_to_std_n0  grade_to_mean_n1  grade_to_std_n1  grade_to_mean_n2  \\\n",
       "0         3.992386          1.874620         4.053876          1.942294   \n",
       "1         3.193909          1.502905         3.185919          1.504054   \n",
       "2         3.193909          1.360761         2.998190          1.532981   \n",
       "3         0.798477          0.368239         0.796491          0.383245   \n",
       "4         2.395431          1.113406         2.430896          1.133984   \n",
       "5         2.395431          1.085997         2.408741          0.984707   \n",
       "6         1.635584          0.736477         1.592982          0.766491   \n",
       "7         0.798477          0.371135         0.810299          0.376013   \n",
       "8         1.596954          0.736477         1.592982          0.755989   \n",
       "9         3.193909          1.502905         3.185919          1.511979   \n",
       "\n",
       "   grade_to_std_n2  grade_to_mean_n4  grade_to_std_n4  grade_to_mean_n5  \\\n",
       "0         4.023418          1.869160         3.948124          1.897562   \n",
       "1         3.173189          1.567352         3.204484          1.511316   \n",
       "2         3.241462          1.273891         3.071276          1.162371   \n",
       "3         0.810366          0.380622         0.806605          0.384972   \n",
       "4         2.439745          1.121496         2.368874          1.197930   \n",
       "5         2.361605          1.141867         2.419815          1.133487   \n",
       "6         1.620731          0.720818         1.621383          0.755658   \n",
       "7         0.793297          0.373832         0.789625          0.368325   \n",
       "8         1.626497          0.720818         1.621383          0.769944   \n",
       "9         3.252993          1.494754         3.218213          1.473298   \n",
       "\n",
       "   grade_to_std_n5  grade_to_mean_n6  grade_to_std_n6  grade_to_mean_n7  \\\n",
       "0         4.055665          1.865760         4.017884          1.840872   \n",
       "1         3.139166          1.515599         3.098975          1.500817   \n",
       "2         3.176718          1.480241         3.125317          1.472698   \n",
       "3         0.802575          0.368526         0.819126          0.369865   \n",
       "4         2.401168          1.120956         2.388727          1.106851   \n",
       "5         2.354374          1.100101         2.459716          1.119411   \n",
       "6         1.569583          0.757800         1.549487          0.738697   \n",
       "7         0.815212          0.366700         0.819905          0.375204   \n",
       "8         1.605151          0.739618         1.580526          0.746274   \n",
       "9         3.260850          1.479236         3.161051          1.492548   \n",
       "\n",
       "   grade_to_std_n7  grade_to_mean_n8  grade_to_std_n8  grade_to_mean_n9  \\\n",
       "0         4.074681          1.851544         4.040923          1.938318   \n",
       "1         3.139721          1.517874         3.086106          1.504140   \n",
       "2         3.259745          1.406712         3.254085          1.530998   \n",
       "3         0.798404          0.377964         0.799464          0.382750   \n",
       "4         2.450979          1.144817         2.403154          1.133458   \n",
       "5         2.396658          1.136053         2.409156          1.011351   \n",
       "6         1.625010          0.757368         1.606104          0.765499   \n",
       "7         0.784930          0.364666         0.813245          0.376035   \n",
       "8         1.597772          0.788374         1.610142          0.755638   \n",
       "9         3.195544          1.497336         3.234727          1.511277   \n",
       "\n",
       "   grade_to_std_n9  grade_to_mean_n10  grade_to_std_n10  grade_to_mean_n11  \\\n",
       "0         4.024912           1.842210          4.108917           1.852810   \n",
       "1         3.174194           1.484104          3.173687           1.482248   \n",
       "2         3.244609           1.504230          3.089208           1.482248   \n",
       "3         0.811152           0.370128          0.799459           0.370562   \n",
       "4         2.441340           1.104961          2.446307           1.111686   \n",
       "5         2.376224           1.124941          2.384061           1.111686   \n",
       "6         1.622304           0.736884          1.643567           0.741124   \n",
       "7         0.793549           0.368003          0.809138           0.370562   \n",
       "8         1.627560           0.749961          1.589374           0.741124   \n",
       "9         3.255120           1.496655          3.146687           1.482248   \n",
       "\n",
       "   grade_to_std_n11  grade_to_mean_n12  grade_to_std_n12  grade_to_mean_n13  \\\n",
       "0          4.009823           1.852810          4.009823           1.857394   \n",
       "1          3.207858           1.482248          3.207858           1.485915   \n",
       "2          3.207858           1.482248          3.207858           1.485915   \n",
       "3          0.801965           0.370562          0.801965           0.371479   \n",
       "4          2.405894           1.111686          2.405894           1.114436   \n",
       "5          2.405894           1.111686          2.405894           1.114436   \n",
       "6          1.603929           0.741124          1.603929           0.742958   \n",
       "7          0.801965           0.370562          0.801965           0.371479   \n",
       "8          1.603929           0.741124          1.603929           0.742958   \n",
       "9          3.207858           1.482248          3.207858           1.485915   \n",
       "\n",
       "   grade_to_std_n13  grade_to_mean_n14  grade_to_std_n14  \n",
       "0          4.005352           1.856379          3.991791  \n",
       "1          3.204282           1.485103          3.193433  \n",
       "2          3.204282           1.315111          3.146801  \n",
       "3          0.801070           0.344287          0.793451  \n",
       "4          2.403211           1.113827          2.395075  \n",
       "5          2.403211           0.923430          2.361914  \n",
       "6          1.602141           0.742552          1.596716  \n",
       "7          0.801070           0.395135          0.846111  \n",
       "8          1.602141           0.846155          1.753293  \n",
       "9          3.204282           1.485103          3.193433  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fg_train_dataset.limit(10).toPandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c51501a-9697-4a23-8ff4-d501589c6095",
   "metadata": {},
   "source": [
    "### Label and Features \n",
    "Suppose the Spark Dataframe of this training dataset only contains numerical features. Here we use `params['label']` column value as label and other columns as features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "3bd8a49d-f687-4378-a9bb-d06fdf164d31",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_cols = [x for x in fg_train_dataset.columns if x not in [params['label']]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "134d3e37-6604-4127-af26-3894c361172a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['loanAmnt',\n",
       " 'term',\n",
       " 'interestRate',\n",
       " 'installment',\n",
       " 'grade',\n",
       " 'subGrade',\n",
       " 'employmentTitle',\n",
       " 'employmentLength',\n",
       " 'homeOwnership',\n",
       " 'annualIncome',\n",
       " 'verificationStatus',\n",
       " 'purpose',\n",
       " 'postCode',\n",
       " 'regionCode',\n",
       " 'dti',\n",
       " 'delinquency_2years',\n",
       " 'ficoRangeLow',\n",
       " 'ficoRangeHigh',\n",
       " 'openAcc',\n",
       " 'pubRec',\n",
       " 'pubRecBankruptcies',\n",
       " 'revolBal',\n",
       " 'revolUtil',\n",
       " 'totalAcc',\n",
       " 'initialListStatus',\n",
       " 'applicationType',\n",
       " 'earliesCreditLine',\n",
       " 'title',\n",
       " 'policyCode',\n",
       " 'n0',\n",
       " 'n1',\n",
       " 'n2',\n",
       " 'n3',\n",
       " 'n4',\n",
       " 'n5',\n",
       " 'n6',\n",
       " 'n7',\n",
       " 'n8',\n",
       " 'n9',\n",
       " 'n10',\n",
       " 'n11',\n",
       " 'n12',\n",
       " 'n13',\n",
       " 'n14',\n",
       " 'issueDateDT',\n",
       " 'grade_target_mean',\n",
       " 'subGrade_target_mean',\n",
       " 'grade_to_mean_n0',\n",
       " 'grade_to_std_n0',\n",
       " 'grade_to_mean_n1',\n",
       " 'grade_to_std_n1',\n",
       " 'grade_to_mean_n2',\n",
       " 'grade_to_std_n2',\n",
       " 'grade_to_mean_n4',\n",
       " 'grade_to_std_n4',\n",
       " 'grade_to_mean_n5',\n",
       " 'grade_to_std_n5',\n",
       " 'grade_to_mean_n6',\n",
       " 'grade_to_std_n6',\n",
       " 'grade_to_mean_n7',\n",
       " 'grade_to_std_n7',\n",
       " 'grade_to_mean_n8',\n",
       " 'grade_to_std_n8',\n",
       " 'grade_to_mean_n9',\n",
       " 'grade_to_std_n9',\n",
       " 'grade_to_mean_n10',\n",
       " 'grade_to_std_n10',\n",
       " 'grade_to_mean_n11',\n",
       " 'grade_to_std_n11',\n",
       " 'grade_to_mean_n12',\n",
       " 'grade_to_std_n12',\n",
       " 'grade_to_mean_n13',\n",
       " 'grade_to_std_n13',\n",
       " 'grade_to_mean_n14',\n",
       " 'grade_to_std_n14']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_cols"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "33ac3348-606c-4eb8-83d2-23ff89d16015",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data = get_vectorassembler(fg_train_dataset, label=params['label'], features='features')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "8657dfe9-7b8a-4004-990c-8621ef2968c6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>isDefault</th>\n",
       "      <th>features</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>[35000.0, 5.0, 19.52, 917.97, 5.0, 21.0, 16128...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>[18000.0, 5.0, 18.49, 461.9, 4.0, 16.0, 89538....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>[12000.0, 5.0, 16.99, 298.17, 4.0, 17.0, 15936...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>[2050.0, 3.0, 7.69, 63.95, 1.0, 3.0, 59830.0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>[11500.0, 3.0, 14.98, 398.54, 3.0, 12.0, 85242...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1</td>\n",
       "      <td>[12000.0, 3.0, 12.99, 404.27, 3.0, 11.0, 65718...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0</td>\n",
       "      <td>[24000.0, 3.0, 9.99, 774.3, 2.0, 7.0, 209276.0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0</td>\n",
       "      <td>[16000.0, 3.0, 7.91, 500.72, 1.0, 4.0, 8198.0,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0</td>\n",
       "      <td>[6000.0, 3.0, 10.49, 194.99, 2.0, 6.0, 115263....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0</td>\n",
       "      <td>[10375.0, 5.0, 15.61, 250.16, 4.0, 15.0, 74728...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   isDefault                                           features\n",
       "0          1  [35000.0, 5.0, 19.52, 917.97, 5.0, 21.0, 16128...\n",
       "1          0  [18000.0, 5.0, 18.49, 461.9, 4.0, 16.0, 89538....\n",
       "2          0  [12000.0, 5.0, 16.99, 298.17, 4.0, 17.0, 15936...\n",
       "3          0  [2050.0, 3.0, 7.69, 63.95, 1.0, 3.0, 59830.0, ...\n",
       "4          0  [11500.0, 3.0, 14.98, 398.54, 3.0, 12.0, 85242...\n",
       "5          1  [12000.0, 3.0, 12.99, 404.27, 3.0, 11.0, 65718...\n",
       "6          0  [24000.0, 3.0, 9.99, 774.3, 2.0, 7.0, 209276.0...\n",
       "7          0  [16000.0, 3.0, 7.91, 500.72, 1.0, 4.0, 8198.0,...\n",
       "8          0  [6000.0, 3.0, 10.49, 194.99, 2.0, 6.0, 115263....\n",
       "9          0  [10375.0, 5.0, 15.61, 250.16, 4.0, 15.0, 74728..."
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_data.limit(10).toPandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "4223dd3e-e9fd-4589-91bf-c1436dcf30d7",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "data": {
      "text/plain": [
       "612742"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_data.count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "e84bdc55-6cbe-49bc-bd1b-6893583a6b9a",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "data": {
      "text/plain": [
       "119541"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_data[train_data[params['label']]==1].count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "019b4d79-b8d4-4945-aa36-0f3524191d71",
   "metadata": {},
   "outputs": [],
   "source": [
    "train, valid = train_data.randomSplit([0.80, 0.20], seed=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4c1eb9b-2af7-4f41-aa02-c26777137e2d",
   "metadata": {},
   "source": [
    "### Train and evaluation\n",
    "In this section, we will use Synapse LightGBM to build our binary classification model. The meaning of model hyper parameters can be referred to:\n",
    "* https://mmlspark.blob.core.windows.net/docs/0.9.5/pyspark/synapse.ml.lightgbm.html#module-synapse.ml.lightgbm.LightGBMClassifier\n",
    "* https://lightgbm.readthedocs.io/en/latest/Parameters.html\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "4e312d44-b3df-460d-894e-66a88d82b26d",
   "metadata": {},
   "outputs": [],
   "source": [
    "hyper_params =  {\n",
    "    'boostingType':'gbdt',\n",
    "    'objective':'binary',\n",
    "    'metric':'auc',\n",
    "    'numLeaves': 2**5,\n",
    "    'lambdaL1':10,\n",
    "    'lambdaL2':10,\n",
    "    'maxDepth':-1,\n",
    "    'minDataInLeaf':20,\n",
    "    'minSumHessianInLeaf':0.001,\n",
    "    'minGainToSplit':0.0,\n",
    "    'featureFraction':0.8,\n",
    "    'baggingFraction':0.8,\n",
    "    'baggingFreq':4,\n",
    "    'learningRate':0.1,\n",
    "    'numIterations':500,\n",
    "    'earlyStoppingRound':100,\n",
    "    'verbosity':1,\n",
    "    'numThreads':16,\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "96024dac-eb2b-4232-ba46-1d4d9aef8cf5",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "from synapse.ml.lightgbm import LightGBMClassifier\n",
    "model = LightGBMClassifier(isProvideTrainingMetric=True, featuresCol=\"features\", labelCol=\"isDefault\", isUnbalance=True, **hyper_params)\n",
    "model = model.fit(train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "f8894099-f030-4fbe-85b2-3eefb77a88d3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "train dataset prediciton\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "22/05/31 11:58:05 WARN DAGScheduler: Broadcasting large task binary with size 1844.3 KiB\n",
      "22/05/31 11:58:06 WARN DAGScheduler: Broadcasting large task binary with size 1857.0 KiB\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+----------------------------------------+----------+\n",
      "|isDefault|features                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |rawPrediction                               |probability                             |prediction|\n",
      "+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+----------------------------------------+----------+\n",
      "|0        |[750.0,3.0,12.29,25.02,3.0,14.0,89346.0,0.0,1.0,19000.0,0.0,4.0,257.0,22.0,24.0,0.0,705.0,709.0,8.0,0.0,0.0,12220.0,84.9,8.0,1.0,0.0,2000.0,8188.0,1.0,0.0,3.0,5.0,5.0,4.0,7.0,7.0,7.0,13.0,5.0,11.0,0.0,0.0,0.0,2.0,245.0,0.22452249131030422,0.2622187489824493,1.1256065037456688,2.3954313805611767,1.1271790305577534,2.3894389593482432,1.1280401590419524,2.379891395761796,1.1214959723820483,2.368874389943421,1.133487280242913,2.3543743736785854,1.1366994292961319,2.3242310396307224,1.1256130501770054,2.3547907759584903,1.1384052131871212,2.3145797340762306,1.1281051474502122,2.3806458523545233,1.1281723580273009,2.316906033448023,1.1116859020723922,2.405893778765448,1.1116859020723922,2.405893778765448,1.1144363408729232,2.403211478824198,1.1138274763789497,2.395074650443236]                  |[-0.018927164784685497,0.018927164784685497]|[0.49526835005754877,0.5047316499424512]|1.0       |\n",
      "|0        |[1000.0,3.0,5.31,30.12,1.0,0.0,66903.0,10.0,0.0,31000.0,1.0,4.0,12.0,21.0,5.73,0.0,695.0,699.0,10.0,0.0,0.0,25993.0,85.2,18.0,0.0,0.0,2005.0,16334.0,1.0,2.0,2.0,4.0,4.0,2.0,2.0,1.0,9.0,14.0,4.0,10.0,0.0,0.0,0.0,5.0,3926.0,0.05983754010496838,0.03261699574560925,0.35176306740451335,0.8226602972585957,0.3749239550119379,0.8107752991923961,0.3779946371269067,0.8132483118785454,0.3604087873868703,0.8106915661017122,0.33924077181208057,0.8149255575637823,0.36303461811918564,0.8304179963867504,0.37175279865051375,0.8069062892299399,0.3743338823757388,0.8086817046807167,0.3778191763446827,0.813779936124784,0.3670894712781709,0.8030288738604626,0.37056196735746405,0.8019645929218161,0.37056196735746405,0.8019645929218161,0.37147878029097436,0.8010704929413993,0.3187290508053279,0.7804945441447098]|[1.6116881438149608,-1.6116881438149608]    |[0.8336456310903819,0.1663543689096181] |0.0       |\n",
      "|0        |[1000.0,3.0,5.32,30.12,1.0,0.0,15407.0,5.0,1.0,52000.0,0.0,4.0,649.0,48.0,24.23,0.0,760.0,764.0,5.0,0.0,0.0,6595.0,41.5,20.0,0.0,0.0,2005.0,16334.0,1.0,0.0,2.0,2.0,2.0,2.0,2.0,16.0,2.0,4.0,2.0,5.0,0.0,0.0,0.0,0.0,3714.0,0.05983754010496838,0.03261699574560925,0.37520216791522293,0.7984771268537255,0.3749239550119379,0.8107752991923961,0.38845873726338714,0.8046836937671467,0.3604087873868703,0.8106915661017122,0.33924077181208057,0.8149255575637823,0.37069586514911396,0.7821725989870186,0.36144809582309584,0.8061567154891175,0.34454801736732715,0.815282614973269,0.38766365363982525,0.8049824492900666,0.3654239756903025,0.8327009988028357,0.37056196735746405,0.8019645929218161,0.37056196735746405,0.8019645929218161,0.37147878029097436,0.8010704929413993,0.4230777230956115,0.876646576128157]|[3.274646999508907,-3.274646999508907]      |[0.9635487383242547,0.03645126167574527]|0.0       |\n",
      "+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+----------------------------------------+----------+\n",
      "only showing top 3 rows\n",
      "\n",
      "Debug --- train metrics:\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "train dataset auc: 0.7706209993721579\n"
     ]
    }
   ],
   "source": [
    "print(\"train dataset prediciton\")\n",
    "predictions = model.transform(train)\n",
    "predictions.show(3, False)\n",
    "\n",
    "from pyspark.ml.evaluation import BinaryClassificationEvaluator\n",
    "evaluator = pyspark.ml.evaluation.BinaryClassificationEvaluator(labelCol=\"isDefault\",metricName=\"areaUnderROC\")\n",
    "auc = evaluator.evaluate(predictions)\n",
    "print(\"train dataset auc:\", auc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "e8ce3a53-5bb4-43fe-a7c8-b1f12dcc6c92",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "validation dataset prediciton\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "22/05/31 11:58:13 WARN DAGScheduler: Broadcasting large task binary with size 1844.3 KiB\n",
      "22/05/31 11:58:15 WARN DAGScheduler: Broadcasting large task binary with size 1857.0 KiB\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------+----------------------------------------+----------+\n",
      "|isDefault|features                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |rawPrediction                           |probability                             |prediction|\n",
      "+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------+----------------------------------------+----------+\n",
      "|0        |[1000.0,3.0,6.08,30.46,1.0,1.0,211367.0,10.0,0.0,42000.0,1.0,5.0,585.0,13.0,6.26,1.0,785.0,789.0,11.0,0.0,0.0,6.0,0.0,38.0,0.0,0.0,1997.0,21154.0,1.0,0.0,1.0,1.0,1.0,5.0,14.0,15.0,9.0,22.0,1.0,11.0,0.0,0.0,0.0,0.0,3898.0,0.05983754010496838,0.04449302765303711,0.37520216791522293,0.7984771268537255,0.36823864614632684,0.796491183373543,0.3884317619877589,0.7800669189561086,0.37368843646656963,0.8045531863602443,0.39435858025218457,0.8095334511433696,0.3711081089572402,0.7890689212109708,0.37175279865051375,0.8069062892299399,0.38042707916063234,0.7965964511914787,0.38796861377506536,0.7819288393507442,0.3760574526757669,0.7723020111493409,0.37056196735746405,0.8019645929218161,0.37056196735746405,0.8019645929218161,0.37147878029097436,0.8010704929413993,0.4230777230956115,0.876646576128157]|[2.367208907513374,-2.367208907513374]  |[0.914292398265566,0.08570760173443401] |0.0       |\n",
      "|0        |[1000.0,3.0,7.89,31.29,1.0,4.0,207844.0,2.0,0.0,24000.0,0.0,0.0,296.0,19.0,13.25,0.0,755.0,759.0,9.0,0.0,0.0,706.0,10.4,11.0,0.0,0.0,2010.0,0.0,1.0,0.0,0.0,2.0,2.0,1.0,1.0,7.0,3.0,3.0,2.0,9.0,0.0,0.0,0.0,2.0,3014.0,0.05983754010496838,0.08352181466548661,0.37520216791522293,0.7984771268537255,0.3401902427637722,0.7495474044598167,0.38845873726338714,0.8046836937671467,0.3473557259603269,0.8050463321281104,0.3232406356413167,0.8071942495730369,0.37889980976537735,0.7747436798769075,0.36566792943787546,0.8066822881025626,0.33803383017919947,0.8026451100309853,0.38766365363982525,0.8049824492900666,0.36800276434001383,0.8091379450703073,0.37056196735746405,0.8019645929218161,0.37056196735746405,0.8019645929218161,0.37147878029097436,0.8010704929413993,0.3712758254596499,0.7983582168144119]    |[2.376691398393738,-2.376691398393738]  |[0.9150325501518889,0.08496744984811107]|0.0       |\n",
      "|0        |[1000.0,3.0,7.89,31.29,1.0,4.0,242774.0,1.0,2.0,62652.0,2.0,0.0,85.0,14.0,16.37,1.0,710.0,714.0,7.0,0.0,0.0,6328.0,29.2,17.0,0.0,0.0,2001.0,0.0,1.0,0.0,0.0,4.0,4.0,1.0,2.0,3.0,6.0,14.0,4.0,7.0,0.0,0.0,0.0,1.0,3105.0,0.05983754010496838,0.08352181466548661,0.37520216791522293,0.7984771268537255,0.3401902427637722,0.7495474044598167,0.3779946371269067,0.8132483118785454,0.3473557259603269,0.8050463321281104,0.33924077181208057,0.8149255575637823,0.36852575437795904,0.8191263572567583,0.3693485372362285,0.8125048208706459,0.3743338823757388,0.8086817046807167,0.3778191763446827,0.813779936124784,0.3684419017437476,0.8217833106306774,0.37056196735746405,0.8019645929218161,0.37056196735746405,0.8019645929218161,0.37147878029097436,0.8010704929413993,0.39513518081325033,0.8461114712920018]       |[1.6383254143919224,-1.6383254143919224]|[0.8373069474638497,0.16269305253615035]|0.0       |\n",
      "+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------+----------------------------------------+----------+\n",
      "only showing top 3 rows\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "validation dataset auc: 0.7321580860915377\n"
     ]
    }
   ],
   "source": [
    "print(\"validation dataset prediciton\")\n",
    "predictions = model.transform(valid)\n",
    "predictions.show(3, False)\n",
    "\n",
    "from pyspark.ml.evaluation import BinaryClassificationEvaluator\n",
    "evaluator = pyspark.ml.evaluation.BinaryClassificationEvaluator(labelCol=\"isDefault\",metricName=\"areaUnderROC\")\n",
    "auc = evaluator.evaluate(predictions)\n",
    "print(\"validation dataset auc:\", auc)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66c89954-4163-4385-acbb-8ec40744245c",
   "metadata": {},
   "source": [
    "### Feature Importance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "e12615ac-1cc0-4b2c-9c97-c5af1528f75d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "            feature_name  importance_gain  importance_split\n",
      "0               subGrade     83997.066126             144.0\n",
      "1        grade_to_std_n4     81705.719417             140.0\n",
      "2       grade_to_mean_n4     76956.117708             112.0\n",
      "3   subGrade_target_mean     57321.622027              32.0\n",
      "4            issueDateDT     57033.936851             941.0\n",
      "5                    dti     29891.142450             798.0\n",
      "6           annualIncome     29267.896991             780.0\n",
      "7                   term     28744.529145             124.0\n",
      "8               loanAmnt     28052.935027             921.0\n",
      "9       grade_to_mean_n7     27357.709897             129.0\n",
      "10              revolBal     21181.492528             741.0\n",
      "11           installment     20755.766657             649.0\n",
      "12       employmentTitle     20452.801992             713.0\n",
      "13         homeOwnership     19117.934570             209.0\n",
      "14          ficoRangeLow     17453.734251             408.0\n",
      "15            regionCode     17303.523351             647.0\n",
      "16     grade_to_mean_n10     15893.145414             167.0\n",
      "17             revolUtil     14771.774867             687.0\n",
      "18      grade_to_mean_n5     13272.692631             128.0\n",
      "19     earliesCreditLine     11562.611207             464.0\n",
      "20              postCode     11471.511600             594.0\n",
      "21                    n2     10326.814517             188.0\n",
      "22          interestRate     10266.527787             326.0\n",
      "23              totalAcc      9778.100644             383.0\n",
      "24                   n14      9291.483289             211.0\n",
      "25       grade_to_std_n6      8286.657939             221.0\n",
      "26                    n6      7189.062676             299.0\n",
      "27      employmentLength      6978.116453             297.0\n",
      "28       grade_to_std_n8      6721.618635             247.0\n",
      "29                 title      5509.364440             246.0\n",
      "30                    n8      5407.500285             249.0\n",
      "31    verificationStatus      4034.273176             129.0\n",
      "32      grade_to_mean_n6      3975.570632             194.0\n",
      "33                    n5      3883.344524             202.0\n",
      "34       grade_to_std_n5      3703.376327             192.0\n",
      "35      grade_to_mean_n8      3604.139187             167.0\n",
      "36               openAcc      3315.712104             174.0\n",
      "37               purpose      3258.018200             132.0\n",
      "38      grade_to_std_n10      3228.527757             167.0\n",
      "39       grade_to_std_n7      3001.766988             162.0\n",
      "40                    n7      2963.788101             157.0\n",
      "41                    n9      2845.468110             114.0\n",
      "42                    n4      2803.227510             137.0\n",
      "43                   n10      2376.972429             128.0\n",
      "44                    n1      2295.242034             112.0\n",
      "45       grade_to_std_n1      2182.863192             120.0\n",
      "46                    n3      2085.303511              39.0\n",
      "47       grade_to_std_n2      2045.537362             114.0\n",
      "48         ficoRangeHigh      1968.884506              59.0\n",
      "49     grade_to_mean_n14      1962.593323             100.0\n",
      "50      grade_to_mean_n1      1950.999537              93.0\n",
      "51      grade_to_mean_n2      1817.888710              72.0\n",
      "52    delinquency_2years      1694.611217              76.0\n",
      "53       grade_to_std_n9      1573.347364              81.0\n",
      "54                pubRec      1380.390000              52.0\n",
      "55      grade_to_mean_n9      1266.019758              61.0\n",
      "56    pubRecBankruptcies      1207.817330              47.0\n",
      "57                    n0      1124.332259              65.0\n",
      "58      grade_to_mean_n0       909.541806              51.0\n",
      "59     initialListStatus       719.327804              39.0\n",
      "60      grade_to_std_n14       685.976515              33.0\n",
      "61       grade_to_std_n0       560.768220              24.0\n",
      "62                   n13       100.508930               6.0\n",
      "63      grade_to_std_n13        72.919600               4.0\n",
      "64     grade_to_mean_n13        33.911000               2.0\n",
      "65                   n12         0.000000               0.0\n",
      "66                   n11         0.000000               0.0\n",
      "67       applicationType         0.000000               0.0\n",
      "68            policyCode         0.000000               0.0\n",
      "69     grade_to_mean_n11         0.000000               0.0\n",
      "70      grade_to_std_n11         0.000000               0.0\n",
      "71     grade_to_mean_n12         0.000000               0.0\n",
      "72      grade_to_std_n12         0.000000               0.0\n",
      "73     grade_target_mean         0.000000               0.0\n",
      "74                 grade         0.000000               0.0\n"
     ]
    }
   ],
   "source": [
    "importance_df = (\n",
    "    pd.DataFrame({\n",
    "        'feature_name': feature_cols,\n",
    "        'importance_gain': model.getFeatureImportances('gain'),\n",
    "        'importance_split': model.getFeatureImportances('split'),\n",
    "    })\n",
    "    .sort_values('importance_gain', ascending=False)\n",
    "    .reset_index(drop=True)\n",
    ")\n",
    "print(importance_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "857a8b4c-cc27-48af-8c4a-abef0f5a39fd",
   "metadata": {},
   "source": [
    "### Hyper Parameters Tuning\n",
    "In this section, we will use `Hyperopt` to tune the hyper parameters of the lightgbm model. Since the hyper parameter combination space is very large, we only demonstrate the search process of the `learningRate` and `numIterations`. If there is enough computing resources, more larger space and iteration rounds can be utilized to get better performance.\n",
    "\n",
    "Here we use `hyperopt.tpe.suggest`, a Bayesian approach to search in the parameter combination space. For more information, please refer to the Jupyter Notebook of Databriks and the documentation of Hyperopt:\n",
    " * https://docs.databricks.com/_static/notebooks/hyperopt-spark-ml.html\n",
    " * http://hyperopt.github.io/hyperopt/#algorithms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "12e0d930-6b74-422d-9850-839ff8de7a77",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from hyperopt import fmin, tpe, hp, Trials, STATUS_OK\n",
    "from pyspark.ml.evaluation import BinaryClassificationEvaluator\n",
    "\n",
    "# hyper paramaters template\n",
    "hyper_params =  {\n",
    "    'boostingType':'gbdt',\n",
    "    'objective':'binary',\n",
    "    'metric':'auc',\n",
    "    'numLeaves': 2**5,\n",
    "    'lambdaL1':10,\n",
    "    'lambdaL2':10,\n",
    "    'maxDepth':-1,\n",
    "    'minDataInLeaf':20,\n",
    "    'minSumHessianInLeaf':0.001,\n",
    "    'minGainToSplit':0.0,\n",
    "    'featureFraction':0.8,\n",
    "    'baggingFraction':0.8,\n",
    "    'baggingFreq':4,\n",
    "    'learningRate':0.1,\n",
    "    'numIterations':500,\n",
    "    'earlyStoppingRound':100,\n",
    "    'verbosity':1,\n",
    "    'numThreads':16,\n",
    "}\n",
    "\n",
    "# define a function to minimize\n",
    "def train_with_hyperopt(params, hyper_params, train, valid):\n",
    "    \"\"\"\n",
    "    An example train method that calls into MLlib.\n",
    "    This method is passed to hyperopt.fmin().\n",
    "\n",
    "    :param params: hyperparameters as a dict. Its structure is consistent with how search space is defined. See below.\n",
    "    :return: dict with fields 'loss' (scalar loss) and 'status' (success/failure status of run)\n",
    "    \"\"\"\n",
    "    # For integer parameters, make sure to convert them to int type if Hyperopt is searching over a continuous range of values.\n",
    "    hyper_params['learningRate'] = params['learningRate']\n",
    "    hyper_params['numIterations'] = int(params['numIterations'])\n",
    "    # train lightgbm model\n",
    "    model = LightGBMClassifier(isProvideTrainingMetric=True, featuresCol=\"features\", labelCol=\"isDefault\", isUnbalance=True, **hyper_params)\n",
    "    model = model.fit(train)\n",
    "    # transform validation dataset\n",
    "    predictions = model.transform(valid)\n",
    "    evaluator = BinaryClassificationEvaluator(labelCol=\"isDefault\",metricName=\"areaUnderROC\")\n",
    "    # evaluate auc\n",
    "    auc = evaluator.evaluate(predictions)\n",
    "    # Hyperopt expects you to return a loss (for which lower is better), so take the negative of the f1_score (for which higher is better).\n",
    "    return {'loss': -auc, 'status': STATUS_OK}\n",
    "\n",
    "# define the search space over hyperparameters\n",
    "space = {\n",
    "  'learningRate': hp.uniform('learningRate', 0.01, 0.1),\n",
    "  'numIterations': hp.uniform('numIterations', 500, 2000),\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "6f707943-a382-4f2f-bad5-7259af66042f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  0%|          | 0/5 [00:00<?, ?trial/s, best loss=?]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "22/05/31 12:01:32 WARN DAGScheduler: Broadcasting large task binary with size 6.5 MiB\n",
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 20%|██        | 1/5 [03:18<13:12, 198.06s/trial, best loss: -0.7316463645843494]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "22/05/31 12:04:11 WARN DAGScheduler: Broadcasting large task binary with size 5.4 MiB\n",
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 40%|████      | 2/5 [05:56<08:44, 174.93s/trial, best loss: -0.732994601578312] "
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "22/05/31 12:05:20 WARN DAGScheduler: Broadcasting large task binary with size 2.2 MiB\n",
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 60%|██████    | 3/5 [07:04<04:11, 125.92s/trial, best loss: -0.732994601578312]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "22/05/31 12:07:33 WARN DAGScheduler: Broadcasting large task binary with size 4.5 MiB\n",
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 80%|████████  | 4/5 [09:17<02:08, 128.93s/trial, best loss: -0.734095579152508]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "22/05/31 12:10:14 WARN DAGScheduler: Broadcasting large task binary with size 5.6 MiB\n",
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100%|██████████| 5/5 [12:00<00:00, 144.02s/trial, best loss: -0.734095579152508]\n"
     ]
    }
   ],
   "source": [
    "from functools import partial\n",
    "best_params = fmin(\n",
    "    fn=partial(train_with_hyperopt, hyper_params=hyper_params, train=train, valid=valid),\n",
    "    space=space,\n",
    "    algo=tpe.suggest,\n",
    "    max_evals=5\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "e8e2f14c-2dd6-4401-a75b-5669e28a1cff",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'learningRate': 0.023764779523899424, 'numIterations': 1273.8549043018331}\n"
     ]
    }
   ],
   "source": [
    "print(best_params)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10c3444f-258d-47c1-bb01-3f3e15c5aeb6",
   "metadata": {},
   "source": [
    "### Retrain the model on the full training dataset\n",
    "In this section, we should use the full training dataset and evaluate the model effect on the test dataset. However, since we do not have labeled test dateset, we can only use the same traing data as before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "950ef732-ee1b-49b5-b48f-43fb3be7a5a6",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "22/05/31 12:19:56 WARN DAGScheduler: Broadcasting large task binary with size 4.5 MiB\n",
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "train dataset auc: 0.7579260854903618\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "22/05/31 12:20:04 WARN DAGScheduler: Broadcasting large task binary with size 4.5 MiB\n",
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "validation dataset auc: 0.7339783176165244\n"
     ]
    }
   ],
   "source": [
    "hyper_params =  {\n",
    "    'boostingType':'gbdt',\n",
    "    'objective':'binary',\n",
    "    'metric':'auc',\n",
    "    'numLeaves': 2**5,\n",
    "    'lambdaL1':10,\n",
    "    'lambdaL2':10,\n",
    "    'maxDepth':-1,\n",
    "    'minDataInLeaf':20,\n",
    "    'minSumHessianInLeaf':0.001,\n",
    "    'minGainToSplit':0.0,\n",
    "    'featureFraction':0.8,\n",
    "    'baggingFraction':0.8,\n",
    "    'baggingFreq':4,\n",
    "    'learningRate':best_params['learningRate'],\n",
    "    'numIterations':int(best_params['numIterations']),\n",
    "    'earlyStoppingRound':100,\n",
    "    'verbosity':1,\n",
    "    'numThreads':16,\n",
    "}\n",
    "\n",
    "model = LightGBMClassifier(isProvideTrainingMetric=True, featuresCol=\"features\", labelCol=\"isDefault\", isUnbalance=True, **hyper_params)\n",
    "model = model.fit(train)\n",
    "\n",
    "predictions = model.transform(train)\n",
    "evaluator = pyspark.ml.evaluation.BinaryClassificationEvaluator(labelCol=\"isDefault\",metricName=\"areaUnderROC\")\n",
    "auc = evaluator.evaluate(predictions)\n",
    "print(\"train dataset auc:\", auc)\n",
    "\n",
    "predictions = model.transform(valid)\n",
    "evaluator = pyspark.ml.evaluation.BinaryClassificationEvaluator(labelCol=\"isDefault\",metricName=\"areaUnderROC\")\n",
    "auc = evaluator.evaluate(predictions)\n",
    "print(\"validation dataset auc:\", auc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "9db83c8a-9283-4f17-b163-a1497a62e1fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "spark.stop()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13abe8c4-bd1f-4aca-b2d4-447151078059",
   "metadata": {},
   "source": [
    "### Acknowledgement\n",
    "Thanks to the Tianchi community for providing the loan default dataset and corresponding tutorial for risk management based on this dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "498e1da5-1be3-4fbc-a8af-230d9f93840e",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
